Skip to content

feat: make ICU the default FTS tokenizer#6968

Open
Xuanwo wants to merge 4 commits into
mainfrom
xuanwo/icu-default-fts-tokenizer
Open

feat: make ICU the default FTS tokenizer#6968
Xuanwo wants to merge 4 commits into
mainfrom
xuanwo/icu-default-fts-tokenizer

Conversation

@Xuanwo
Copy link
Copy Markdown
Collaborator

@Xuanwo Xuanwo commented May 27, 2026

This changes the default native FTS tokenizer from simple to icu so new inverted indexes handle mixed-language text without requiring users to opt into multilingual tokenization. Legacy missing tokenizer metadata continues to resolve to simple, and builds without the ICU feature still fall back to simple.

Benchmark summary from the 100M-row runs:

Dataset Retrieval impact Build cost Index size Query latency
English-only Recall unchanged +15.4% +0.6% Common terms flat; rare terms slightly slower but still small
Mixed-language ZH / JP / TH rare recall improves from 0.0 to 1.0 +20.4% +25.7% EN / FR flat; multilingual rare queries roughly flat

The key tradeoff is that ICU has modest overhead for English-only data, but it fixes default recall for unspaced CJK, Japanese, and Thai text.

Detailed benchmark numbers

English-only 100M rows

Metric simple icu Difference
Build time 15.31s 17.66s +15.4%
Index size delta 964.0MB 969.7MB +0.6%
Common-term latency 2370.1ms / 1179.4ms 2372.5ms / 1188.8ms Flat
Rare-term recall 1.0 / 1.0 1.0 / 1.0 No change
Rare-term latency 11.7ms / 14.2ms 23.3ms / 17.7ms Slightly slower

Mixed-language 100M rows

Metric simple icu Difference
Build time 33.35s 40.16s +20.4%
Index size delta 800.1MB 1005.7MB +25.7%
EN common latency 1026.7ms 953.8ms Slightly faster
FR common latency 945.9ms 946.1ms Flat
CJK common query 0 rows / 1.1ms 10 rows / 944.1ms simple misses
ZH / JP / TH rare recall 0.0 / 0.0 / 0.0 1.0 / 1.0 / 1.0 ICU recovers recall
FR / EN rare recall 1.0 / 1.0 1.0 / 1.0 No change
Rare-query latency 13.7-24.6ms 11.1-23.1ms Roughly flat

@github-actions github-actions Bot added enhancement New feature or request python labels May 27, 2026
@Xuanwo Xuanwo marked this pull request as ready for review May 28, 2026 06:37
Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This repository is configured for manual code reviews. Comment @claude review to trigger a review and subscribe this PR to future pushes, or @claude review once for a one-time review.

Tip: disable this comment in your organization's Code Review settings.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 28, 2026

Codecov Report

❌ Patch coverage is 92.30769% with 2 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
rust/lance-index/src/scalar/inverted/index.rs 0.00% 1 Missing ⚠️
rust/lance-index/src/scalar/inverted/tokenizer.rs 94.11% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

Copy link
Copy Markdown
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like good rationale. The only thing that makes me slightly hesitant is having default_base_tokenizer depend on a feature flag since it might be confusing.

Still, it's in the default feature list, and it's always going to be on for wheels / pylance, so I guess the vast majority of users will never be clearing this flag.

@Xuanwo
Copy link
Copy Markdown
Collaborator Author

Xuanwo commented May 28, 2026

The only thing that makes me slightly hesitant is having default_base_tokenizer depend on a feature flag since it might be confusing.

I'm open to just remove this feature and make it always enabled.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request python

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants